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Abstract 

A  fundamental  problem  that  any  scalable  multiprocessor  must  address  is  the  ability  to 
tolerate  high  latency  memory  operations.  This  paper  explores  the  extent  to  which  multiple 
hardware  contexts  per  processor  can  help  to  mitigate  the  negative  effects  of  high  latency.  In 
particular,  we  evaluate  the  performance  of  a  directory>based  cache  coherent  multiprocessor 
using  memory  reference  traces  obtained  from  three  parallel  applications.  We  explore  the 
case  where  there  are  a  small  fixed  number  (2-4)  of  hardware  contexts  per  processor  and 
the  context  switch  overhead  is  low.  In  contrast  to  previously  proposed  approaches,  we  also 
use  a  very  simple  context-switch  criterion,  namely  a  cache  miss  or  a  write-hit  to  shared 
data.  Out  results  show  that  the  effectiveness  of  multiple  contexts  depends  on  the  nature 
of  the  applications,  the  context  switch  overhead,  and  the  inherent  latency  of  the  machine 
architecture.  Given  reasonably  low  overhead  hardware  context  switches,  we  show  that  two 
or  four  contexts  can  achieve  substantial  performance  gains  over  a  single  context.  For  one 
application,  the  processor  utilization  increased  by  about  65%  with  two  contexts  and  by  about 
100%  with  four  contexts.  / 
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Abstract 

A  fuiidaineiital  problcr^i  that  any  scalable  tnuhiprocessor 
imist  address  i>  the  aliility  to  tolerate  high  latency  memory  ' 
operations.  Tliis  paper  e.vplores  the  extent  to  which  multi¬ 
ple  hartlware  (.ontexts  per  processor  can  help  to  mitigate  the 
negative  elfecis  of  high  latency.  In  particular,  we  evaluate 
the  performance  of  a  directory-based  cache  coherent  multi- 
proces.sor  using  memory  reference  traces  obtaitied  from  three 
parallel  applications.  We  explore  the  ca.se  where  there  are 
a  small  fixed  number  i  .’-4)  of  hardware  contexts  per  proces¬ 
sor  and  the  context  stvitch  overhead  is  low.  In  contrast  to 
previously  proposed  approaches,  we  also  use  a  very  simple 
context-switch  criterioti.  namely  a  cache  miss  or  a  write-hit 
to  shared  data.  Our  results  show  that  the  effectiveness  of 
multiple  contexts  depends  on  the  nature  of  the  applications, 
the  context  switch  overhead,  and  the  inherent  latency  of  the 
machine  architecture.  Given  reasonably  low  overhead  hard¬ 
ware  context  switches,  we  show  that  two  or  four  contexts  can 
achieve  substantial  performance  gains  over  a  single  context. 
For  one  application,  the  processor  utilization  increased  by 
.\hottt  liH'/l  with  two  contexts  and  by  about  lOOVf  with  four 
contexts. 

1  Introduction 

.As  shared-memory  multiprocessors  are  scaled  (the  number  of 
processors  is  increased),  there  will  invariably  be  an  increase 
in  the  latency  of  memory  operations.  While  local  memory 
references  need  not  have  higher  latency,  remote  memory  op¬ 
erations  will  encounter  higher  latency  because  of  the  larger 
physical  size  of  the  machine,  if  not  for  any  other  reason.  Con- 
'e»|uently.  tliere  will  always'  be  times  when  a  processor  sits 
idle,  waiting  for  some  remote  operation  to  complete  [2.11].  If 
more  than  one  context  resides  on  each  processor,  and  con¬ 
text  switch  overhead  is  low.  this  itlle  time  can  be  used  by 
additional  contexts.  Typically  each  context  corresponds  to  a 
process  from  one  parallel  program. 

Ill  this  paper,  we  evaluate  the  utility  of  mubiple  contexts 
per  processor  for  a  direcior.v-based  cache  coherent  multipro- 
lessor  [ij.  While  the  idea  of  using  multiple  hatdware  con¬ 
texts  per  processor  is  itself  not  new.  we  believe  our  scheme  is 
simpler  to  implement  than  other  proposals  [-1 .8. It. It'. 21]  (dis¬ 
cussed  in  Section  i),  In  our  scheme,  each  processor  contains 
a  small  lixed  number  (2-4)  of  hardware  contexts  with  inde¬ 
pendent  register  sets  to  enable  short  context  switch  times. 
We  Also  use  a  very  simple  context  switch  criterion,  which  is 
to  switch  contexts  on  a  caclie  miss  or  on  a  write-hit  to  read- 


shared  data  or  when  a  watchdog  counter  of  1000  expires.* 
This  .simple  scheme  helps  keep  context  switch  overhead  low. 
because  the  decision  to  switch  or  not  can  be  made  in  a  single 

'***^cycle.  .  •  J,- ..-AV 

Our  multiple  context  scheme  is  evaluated  using  multipro¬ 
cessor  memory-reference  traces  obtained  from  three  applica¬ 
tions  [13. 16.20].  The  results  indicate  that  multiple  contexts 
can  achieve  substantial  gains  in  processor  utilization.  In  some 
cases  processor  utilization  is  increased  by  63%  with  two  con¬ 
texts  and  by  100%  with  four  contexts. 

The  rest  of  the  paper  is  organized  as  follows.  The  next 
section  presents  the  architecture  and  simulator  used  in  this 
study.  We  also  introduce  the  applications  and  the  method 
employed  to  gather  the  reference  traces.  Section  3  gives  gen¬ 
eral  results  for  the  three  applications.  After  that  we  present 
a  number  of  issues  concerning  multiple  contexts.  This  section 
also  gives  the  results  of  the  simulations.  Finally,  we  have  the 
related  work,  discussion  and  conclusion  sections. 

2  Architectural  Assumptions 
and  Simulation  Environment 

In  this  section,  we  discusv  the  architectural  assumptions  that 
we  make  and  describe  the  simulation  environment  that  we 
used  to  obtain  our  results.  We  also  describe  the  applications 
used  in  this  study  and  the  performance  metric  employed  to 
evaluate  the  multiple  context  scheme. 

2.1  Base  Architecture  and  Simulator 

Figure  1  shows  the  basic  architecture  that  we  assume  in  this 
paper.  The  architecture  consists  of  several  nodes  linked  to¬ 
gether  by  an  interconnection  network.  Each  node  has  a  pro¬ 
cessor.  a  physical  cache,  and  its  share  of  the  global  memory. 
It  is  connected  to  the  network  through  the  directory  (DIR) 
.and  network  interface  (N.I.).  The  processors  may  have  one 
or  more  contexts.  The  caches  are  kept  consistent  using  a 
directory-based  cache  coherence  protocol  as  disctts-seil  in  (l]. 
We  study  the  iterformance  as  a  function  of  several  parameters 
such  as  the  niiinlier  of  contexts,  the  context  switch  ov'erhead. 
the  latency  of  the  network,  and  so  on.  Performance  results 
its  a  function  of  tlie  above  parameters  are  given  in  Section  4. 

*The  watclidog  cotmter  is  intrruluced  to  prewnt  one  esuttext 
from  hogging  a  lutriicular  pixicesMir.  Tltis  ensures  that  i\o  context 
nuts  for  Imtger  thait  lUUO  cycles  at  a  time,  prev'eming  Marvatuut 
aivl  dearllocks. 


I 


WaPtCTED 


Exploring  the  Benefits  of  Multiple  Hardware  Contexts  in  a 
Multiprocessor  Architecture:  Preliminary  Results 

Wolf-Dietrich  Weber  and  Anoop  Gupta 
Computer  Systems  Laboratory 
Stanford  University 
Stanford,  CA  94305 


November  16,  1988 


Abstract 

A  fundamental  problem  that  any  scalable  multiprocessor  must  address  is  the  ability  to 
tolerate  high  latency  memory  operations.  This  paper  explores  the  extent  to  which  multiple 
hardware  contexts  per  processor  can  help  to  mitigate  the  negative  effects  of  high  latency.  In 
particular,  we  evaluate  the  performance  of  a  directory-based  cache  coherent  multiprocessor 
using  memory  reference  traces  obtained  from  three  parallel  applications.  We  explore  the 
case  where  there  are  a  small  fixed  number  (2-4)  of  hardware  contexts  per  processor  and 
the  context  switch  overhead  is  low.  In  contrast  to  previously  proposed  approaches,  we  also 
use  a  very  simple  context-switch  criterion,  namely  a  cache  miss  or  a  write-hit  to  shared 
data.  Our  results  show  that  the  effectiveness  of  multiple  contexts  depends  on  the  nature 
of  the  applications,  the  context  switch  overhead,  and  the  inherent  latency  of  the  machine 
architecture.  Given  reasonably  low  overhead  hardware  context  switches,  we  show  that  two 
or  four  contexts  can  achieve  substantial  performance  gains  over  a  single  context.  For  one 
application,  the  processor  utilization  increased  by  about  65%  with  two  contexts  and  by  about 
100%  with  four  contexts. 


1 


Figure  1:  Architectural  model 


We  use  a  irace-il riven  simnlator.  written  hy  Truman  -loe 
at  :Stanford.  tliat  emulates  tlie  above  architecture  to  evaluate 
the  effect iveiiess  of  multiple  contexts.  In  the  sin.^le  context 
per  processor  case,  the  simulator  works  as  follows.  Before 
startinn  the  simulation,  we  hrst  divide  the  interleaved  refer¬ 
ence  stream  aenerated  by  the  tracing  program  into  separate 
streams  ior  indiviiliial  processors.  Then,  one  reference  stream 
is  assigned  to  each  of  the  processors.  At  every  simulated  clock 
cvcle.  each  active  processor  reads  the  next  reference  from  its 
a.ssociated  reference  stream.  If  the  reference  hits  in  the  cache" 
the  processor  remains  active  and  will  issue  another  reference 
from  the  stream  on  the  next  clock  tick.  However,  if  it  misses 
or  a  write  to  read-shared  data  occurs,  it  context  switches. 
The  cache  sends  a  ret|uest  over  the  network  to  fetch  the  mis.s- 
ing  line  and/or  update  the  state  of  the  other  caches  in  the 
system.  During  the  period  of  time  that  the  cache  request  is 
waiting  to  be  satisfied,  the  processor  remains  in  a  suspended 
state  and  does  not  generate  any  more  references. 

In  ca.«e  of  multiple  contexts  per  processor,  we  have  multi¬ 
ple  memory  reference  streams  associated  with  each  processor 
—  one  for  each  conte.xt.  Ai  any  given  time  only  one  of  these 
contexts  is  active  and  the  memory  references  come  from  that 
>iream.  However,  when  the  active  context  enters  the  sus¬ 
pended  state  due  to  a  caclie  miss  or  a  write  hit  on  read-shared 
data,  a  context  switch  occurs.  The  processor  stays  idle  for 
the  time  required  to  perform  the  context  switch,  .-^fter  that, 
memory  references  are  issued  from  the  newly  activated  con¬ 
text.  If  more  than  one  context  is  ready  when  the  active  con¬ 
text  blocks,  a  round-robin  scheduling  scheme  decides  which 
context  is  to  be  activated  next. 

The  simulator  that  we  use  is  quite  detailed  in  that  it  models 
contention  for  tiie  memory  modules,  for  the  bus  on  which 
the  memory  modules  reside,  for  the  directory,  associated  with 
each  node,  and  for  the  interconnection  network.  It  is  also 
possible  to  vary  the  delays  associated  with  each  of  the  above 
modules.  We  no.ie  that  the  interconnection  network  a.ssumed 
in  our  simulations  is  a  crossbar  switch,  but  it  could  be  any 
point-to-point  network  (e.g..  grid  [18].  butterfly  [3].  omega 
[l3]l  depending  on  the  number  of  proces.sors  we  wished  to 
interconnect.  For  i lie  default  parameters  that  we  used  fshown 
in  Table  1 1.  a  remote  read  takes  -'7  cycles  and  a  remote  write 
takes  1*J  cycles  with  no  contention.  The  local  operations  take 
li>  and  13  cycles  respectively.  With  contention  these  numbers 
can  grow  to  as  large  as  Kill  cycles  in  our  simulations. 

The  .simulator  is  driven  by  multiproce.ssor  memory  refer¬ 
ence  traces.  Since  the  traces  include  10  reference  streams,  we 
.are  limited  to  four  processors  if  we  wish  to  explore  four  con- 

•For  writes,  the  lucatiun  has  in  be  owne<l  in  addition  to  being 
|irc*eni  in  the  cariie. 


Oi>t*ratioii 

rime 

Memory  Latency 
Bus  Transfer 
•Switch  Latency 
Switch  Transfer 
Directory  Lookup 

6  cycles 

4  cycles 

1'  cycles 

4  cycles 
cycles 

Table  I:  Default  Parameters  for  .Simulator 


texts  per  processor.  For  runs  with  fewer  than  four  coniexU. 
only  some  of  the  reference  streams  were  used.  We  model  the 
scaling  of  the  machine  architecture  to  a  larger  number  of  pro- 
ces.sors  by  increasing  the  latency  in  the  underlying  network 
(see  Section  4.3).  We  also  vary  the  context  switch  overhead 
and  the  number  of  contexts  per  processor.  Section  4  will 
present  the  issues  involved  and  the  results  obtained. 

One  inaccuracy  in  our  simulator  is  that  we  a.ssume  an  in¬ 
finite  cache  for  each  processor."  Thus,  we  do  not  model  the 
interference  in  the  caches  when  there  are  multiple  contexts 
per  proces.sor.  It  is  not  clear,  though,  whether  the  sharing 
of  caches  is  an  advantage  or  a  disadvantage.  If  the  caches 
are  small,  interference  might  be  a  serious  problem.  With 
fairly  large  caches,  however,  the  pre-fetch  achieved  by  con¬ 
texts  working  on  the  same  shared  data  could  actually  be 
beneficial.'*  The  caches  in  the  architecture  presented  here 
are  expected  to  be  large  as  tliey  serve  as  the  main  source  of 
remote  code  and  data. 


2.2  Traces  and  Applications 

The  multiprocessor  traces  used  in  our  simulations  were  gath¬ 
ered  on  a  \'.\X  8350.  using  a  combined  hard  ware/ software 
.scheme  [5].  Basically,  the  tracing  works  as  follows.  We 
spawn  as  many  processes  as  the  application  desires  under 
the  control  of  a  master  process.  The  master  process  then 
single  steps  the  application  processes  in  a  round-robin  man¬ 
ner.  .A.fter  each  step,  it  records  all  references  made  by  the 
application  processes.  For  each  reference,  the  number  of  the 
processor  producing  it.  the  address  of  the  reference  and  its 
type  (read/write/ifetch)  arc  recorded.  The  traces  that  we  use 
correspond  to  16-processor  runs. 

The  traces  used  were  obtained  from  three  applications:  Lo- 
cusRoute.  MP3D  and  P-Thor.  LocusRoiite  [16.17]  is  a  stan¬ 
dard  cell  global  router.  While  the  tasks  spawned  by  it  are 
quite  coarse  in  granularity  (each  may  execute  around  lOO.OQO 
instructions),  its  central  data  structure  (a  global  cost  arra>') 
is  shared  at  a  fine  granularity.  MP3D  [13]  is  a  3-dimen$ional 
particle  simulator  that  determines  the  shock  waves  generated 
by  a  body  flying  at  high  speed  in  the  upper  atmosphere.  It 
uses  distributed  loops  for  parallelization  (each  loop  executes 
around  '-’50  instructions)  and  it  is  a  typical  example  of  par¬ 
allel  scientific  code.  P-Tlior  [.’O]  is  a  parallel  logic  simulator 
that  uses  the  Cliandy-Misra  dist ributetl  simulation  algorithm. 
Each  parallel  s\tbta.sk  (a  component  evaluation)  in  P-Thor 
lakes  al)out  .Itlll  instructions  to  execute. 

are  working  on  an  a  new  version  of  tlte  simulator  tlvat  will 
remove  litis  restriction. 

■‘Note  that  in  our  execution  nvxiel.  several  processes  fboux  the 
.mnit  application  are  using  the  multiple  contexts.  Thus  tite  antount 
of  shared  *lata  can  be  significant. 
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2.3  Performance  Measure 

TIk*  fisnrt-of  merit  iisetl  in  evaiiialiiig  lyultiple  conlexi.s 
ill  iliis  paiier  is  jtitHtMior  ^fficitiicy.  This  is  defined  as  the 
number  of  cycles  s[>ent  doing  useful  work  over  the  total  num¬ 
ber  of  cycles.  Of  course,  the  ina.xiniuiu  is  one  reference  per 
proces.sor  per  cycle  for  ll)(l%  efficiency.  The  more  time  the 
processors  spend  idle,  waiting  for  remote  reads  and  writes, 
the  lower  the  overall  processor  efficienc.v.  In  onr  simulations, 
we  ran  the  .system  for  a  total  of  50(1.000  clock  cycles,  and  then 
counietl  the  number  of  iiieinorv  references  consumed  from  the 
traces  to  get  the  efficiency. 

3  General  Results 

In  this  section  we  present  some  general  results  obtained  with 
the  simulator.  These  results  give  an  overall  idea  of  the  differ¬ 
ences  in  behavior  of  the  three  applications.  They  also  show 
the  effect  of  increasing  the  switch  latency  on  the  read  and 
write  latencies  seen  by  the  processors.  The  numbers  are  for 
a  -1-processor  system  with  one  context  per  processor.  The 
tables  below  give  data  about  the  run  lengths  and  latencies 
for  the  three  applications.  Run  length  is  defined  as  the  num¬ 
ber  of  simulator  cycles  between  each  cache  miss.'  Read  and 
write  latencies  are  the  number  of  cycles  required  to  satisfy 
the  cache  miss. 

Results  for  switch  latencies  of  2  and  16  cycles  are  presented. 
.\  switch  latency  of  only  two  cycles  is  clo.se  to  the  minimum 
that  can  be  achieved  with  any  type  of  network.  The  switch 
latency  of  16  represents  the  latencies  that  might  be  expected 
ill  a  larger  multiprocessor  with  many  more  nodes. 
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Table  General  application  results  with  switch  latency 
of  2  cycles 
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54 
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Table  3:  fJeneral  application  results  with  switch  latency 
of  lb  cycles 

Both  average  and  median  values  are  given  to  convey  more 
information  conceritiiig  the  distribution  of  the  rnu-lengths 
and  latencies.  .Median  values  are  more  representative  in  char¬ 
acterizing  the  typical  run-length.  In  LocnsRonie.  for  exam- 

'Bi’iilt  here  and  in  the  rest  of  the  paper,  by  rorln  «ii»i  we  ac- 
mally  mean  references  that  can  not  be  satisfied  by  ilie  cache  alone 
iiiid  ti»"ed  to  access  the  inemory.  or  the  net  work,  or  both.  These  in- 
cliitle  frintliir  cache  misses  but  also  wrile-liits  to  read-sharrcl  data. 
Ill  the  latter  rase,  the  network  needs  to  be  accessed  to  Invalidate 
I  bat  location  from  other  caches  and  to  gain  ownership  of  that  cache 
line, 


pit*,  due  to  a  few  very  long  runs  the  averages  are  high  even 
ihoiigli  the  median  values  arl/  much  lower. 

MP.'JD  has  the  shortest  run-length  and  longest  latencies. 
There  is  a  lot  of  global  data  traffic  in  MP3D  and  this  leads 
to  fretiuent  misses,  i.e.  short  run  lengths.  LocusRoute.  on 
the  other  hand  has  very  long  run-lengths.  The  large  size 
of  the  tasks  and  their  relative  indepemlence  allows  for  large 
portions  of  code  that  execute  out  of  the  cache  without  any 
misses.  The  latencies  are  close  to  the  minimum  expected  for 
this  architecture.  P-Thor  is  somewhere  in  between  the  other 
two  applications. 

.A.S  the  switch  latency  increa.se.s.  the  read  and  write  laten¬ 
cies  grow  as  well.  Reads  are  affected  more  because  they 
retpiire  a  two-way  transaction  and  so  the  higher  latency  is 
incurred  twice.  Run  lengths  should  be  unaffected  by  the  in¬ 
creased  latency,  but  in  fact  we  do  .see  a  sliglit  decrease  in 
run  lengths  as  the  switch  latency  increa.ses.  This  is  proba¬ 
bly  due  to  a  cold-start  effect  of  the  caches.  Run-lengths  near 
the  beginning  of  the  reference  streams  are  shorter  on  average, 
because  more  cache  misses  are  incurred. 

4  Issues  and  Results 

\Ve  wsh  to  explore  several  questions  concerning  the  perfor¬ 
mance  of  multiple  contexts: 

•  How  many  contexts  are  required  to  achieve  good  pro¬ 
cessor  utilization? 

•  How  does  the  context  switch  overhead  affect  the  per¬ 
formance? 

•  What  is  the  effect  of  increasing  the  switch  latency? 

•  When  to  switch  contexts? 

•  How  much  does  the  performance  vary  with  application? 

This  section  explores  all  of  these  issues  and  presents  results. 
We  show  graphs  of  processor  efficiency.  In  each  graph,  we  are 
plotting  the  number  of  active  cycles  over  the  total  number  of 
c.vcles  against  the  switch  latency  of  the  architecture.  We  show 
efficiencies  for  one.  two  and  four  contexts.  Different  context 
switch  overheads  are  presented  on  different  graphs.  Figures 
2-4  show  results  for  MP3D.  Figures  5-7  give  results  for  P- 
Thor  and  Figures  8-10  show  results  for  LocusRoute. 

4.1  Number  of  Contexts 

Depending  on  the  single  context  processor  efficiency,  it  may 
or  may  not  be  worthwhile  to  use  two.  four  or  more  contexts. 
Note  that  the  single-processor  efficiency  is  basically  a  func¬ 
tion  of  the  cache  miss  rale  and  the  read  and  write  latency  for 
the  architecture.  For  LocusRoute  (Figures  8-10)  the  proces¬ 
sor  efficiency  is  already  very  high  (about  90%)  with  a  single 
context  and  little  performance  can  be  gained  by  adding  mote 
contexts.  As  a  matter  of  fact,  if  the  context  switch  owhead 
is  liigit.  four  contexts  do  worse  titan  one  (Figure  10).  MP3D 
on  the  other  hand  ( Figure  2).  lias  single  context  performance 
near  .5(1%  and  achieves  substantial  gains  with  more  contexts 
(efficiency  is  77%  with  2.  04%  with  4). 

As  expected,  the  gra|)h.s  show  diiuintshing  marginal  returns 
as  the  number  of  contexts  is  increased  (see  Figure  5  tor  ex¬ 
ample).  In  every  ca,se  going  from  one  to  two  contexts  yieWs  a 
greater  benefit  titan  going  from  two  to  four  coutexts.  .\  small 
iitimher  ofconiexls  is  also  |»referable  because  it  allows  simpler 
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Figure  o:  P-Tlior:  Context  Switch  Overhead  1  Cycle 


Figure  3;  .MP3D;  Context  Switch  Overhead  4  Cycles 


Figure  6:  P*Thor:  Context  Switch  Overhead  4  Cycles 
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Figure  7:  P-Thor;  Context  Switch  Overhead  10  CVvW 
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Figure  s;  Locii^iRout.e:  ("oiuext  Switch  Overliead  I  Cycle 


Figure  9:  LocusRoute:  Context  Switch  Overhead  4  Cy¬ 
cles 


Figure  lU:  LociisRoiiie:  Context  Switch  Overhead  |6 
( ’ycles 


hardware.  With  a  larger  niiinher  of  contexia.  a  penalty  iu  the 
cycle  time  of  the  processor  or  an  increase  in  context  switch 
■'.verhead  may  be  inevitable.  Also,  a  large  number  of  contexts 
!  “unires  a  large  number  of  processes.  .Many  apjtlications  may 
',ot  be  able  to  support  such  a  large  number  of  processes. 

4.2  Context  Switch  Overhead 

The  context  switch  overhead  depends  on  the  number  of  coit- 
lexis  kept  in  hardware,  the  amount  of  slate  kept  for  each 
context,  and  the  amount  of  hardware  dedicated  to  conte.xt 
^witching.  We  explore  context  switch  overheads  of  1.  4  and 
16  cycles.  A  single  cycle  overhead  can  be  achieved  by  keejv 
ing  multiple  copies  of  the  pipeline  registers  and  being  able 
to  swap  in  the  whole  state  in  a  single  cycle.®  If  the  pipeline 
has  10  be  drained  and  filled,  a  4-cycle  overhead  is  rea.sonahle. 
Both  of  these  options  require  multiple  register  banks,  one  for 
each  context.  If  we  want  to  load  and  store  the  registers  to 
some  fast  local  memory,  we  have  lo  allow  at  least  16  cycles. 
It  is  cleat  that  the  hardware  is  more  complex  if  we  require  the 
context  switch  to  be  faster.  Of  course,  beyond  some  overhead 
value,  multiple  contexts  do  not  help  any  more,  since  a  long 
latency  operation  will  complete  before  the  context  switch  is 
achieved. 

-As  expected,  the  results  show  that  the  effect  of  increasing 
the  context  switch  overhead  reduces  the  benefit  achieved  by 
having  multiple  contexts.  Note  that  the  single  context  graph 
line  is  identical  for  various  context  switch  overheads  (see  Fig¬ 
ures  2-4  for  example),  since  there  is  no  context  switching  in 
that  case.  When  the  context  switch  overhead  is  16,  none  of 
the  programs  are  gaining  much  processor  efficiency  with  in¬ 
creased  contexts.  MP3D  achieves  a  12%  increase  in  efficiency 
with  4  contexts  (Figure  4).  P-Tlior  gains  only  5%.  (Figure  T) 
and  LocusRoute  actually  looses  12%  (Figure  10).  For  mul¬ 
tiple  contexts  to  be  useful,  the  context  switch  overhead  will 
have  to  be  kept  low.  preferably  on  the  order  of  a  few  cycles. 

4.3  Latency 

The  amount  of  latency  incurred  in  remote  operations  is  im¬ 
portant  for  the  effectiveness  of  processors  with  multiple  con¬ 
texts.  With  very  low  latencies,  context  switch  overhead  may 
be  too  large  to  allow  multiple  contexts  to  achieve  any  per¬ 
formance  gain.  As  the  latency  increases,  the  single  context 
processors  do  increasingly  poorly  because  more  and  more  pro- 
ces.sor  time  is  spent  idle.  This  is  where  multiple  contexts  can 
help.  .As  seen  in  Figures  5-7.  the  relative  value  of  multiple 
contexts  increa.ses  as  the  latency  increases.  In  other  words, 
a  proce.ssor  with  mtiltiple  contexts  will  suffer  less  efficiency 
degradation  due  to  high  latencies  than  a  single  context  pro¬ 
cessor. 

One  reason  for  varying  switch  latency  in  our  evaluation  of 
multiple  contexts  is  lo  explore  different  types  of  architec\ures. 
.A  grid  network,  for  example,  i.s  expected  to  have  a  much 
Larger  latency  than  a  crossbar  .switch.  At  the  same  time  the 
liiglier  latencies  can  correspond  to  larger  multiprocessors.  As 
more  processors  are  added  to  a  p.arallel  machine,  the  latencies 
increa.se  due  to  deeper  networks  or  snore  comple.x  switches. 
Liirger  latencies  present  a  greater  opportunity  for  multiple 
contexts,  because  the  single  context  efficiency  is  lower.  .At 
the  same  time  we  note  that  it  is  still  passible  to  achieve  wy 
high  elficiencies  with  just  a  few  contexts.  For  example,  with 

‘'.Alternatively  multiplexors  could  l«  ttserl  to  switch  between 
inuliitde  pipellun  stale  I'opies, 
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a  swiicli  latc-iicy  of  Hi  cycles.  la.“eiicies  are  on  the  order  of  50 
and  51)  cyclo  for  reails  and  writes  respectively  (see  Section 
■>).  network  large  enough  to  have  this  high  a  latency  could 
well  support  several  hnndreil  I'rocessors.  Vet  processor  effi¬ 
ciencies  stay  high  for  this  latency  (til)'/  for  MP5D.  •?()%  for 
P-Thor  anil  tMVT  for  LociisRonte).  The  point  is  that  even  as 
imiltiproces>ors  grow  and  latencies  increase,  processors  with 
just  a  few  contexts  achieve  very  good  ntilizatioit. 

4.4  When  to  Switch  Contexts 

Mealiy.  one  would  like  to  switch  contexts  whenever  the  con¬ 
text  switch  oveihead  is  less  than  the  latency  of  the  operation 
being  performed.  Of  course  external  operations  may  take 
longer  or  sliorter  depemlittg  on  the  congestion  in  the  machine, 
and  there  is  no  ea.sy  way  to  predict  how  long  a  given  operation 
will  take.  Wc- 1 litis  choose  the  easiest  context  switch  criterion: 
switch  on  any  operation  that  reipiires  a  main  memory  access, 
either  in  the  same  cluster  or  remotely,  .''witching  only  on 
ifwolt  operations  ret(uires  extra  iiardware.  but  is  a  feasible 
alreruaiive  if  conte.xt  switch  overhead  is  relatively  high.  If  a 
context  switch  takes  lii  cycles,  and  local  operations  also  take 
on  tlie  order  of  10  cycles  to  complete,  it  does  not  make  sense 
to  initiate  a  context  switch  on  every  local  oi>eration. 

Two  of  the  applications  had  frequent  memory  accesses,  but 
LocusPionte  processes  had  long  streaks  cf  execn'ing  out  of 
the  c.aclie.  in  order  to  prevent  one  context  from  hogging  a 
particular  processor  we  introduce  a  watchdog  counter  that 
pre-empts  the  current  context  after  1000  cycles.  This  ensures 
that  no  context  tuns  for  longer  than  lUUO  cycles  at  a  time, 
thus  allowing  ail  contexts  on  a -particular  processor  to  make 
progress. 

4.5  Applications 

The  three  applications  exhibited  very  different  behavior.  Lo- 
cusPiOuie  and  P-Thor  have  relatively  little  global  traffic, 
whereas  .MP5D  has  a  lot.  While  l.SVf  cf  LocusRoute  instruc¬ 
tions  cause  references  to  shared  data,  this  number  is  close  to 
12'/t.  for  MPlD.  This  explains  why  the  run-lengths  presented 
in  Section  3  are  so  different  for  the  three  applications.  .\t  the 
‘ame  time  LocusRoute  has  very  good  caching  behavior  and 
^ery  little  interference  between  processes.  Thus  LocusRoute 
achieves  very  high  efficiencies  (around  90%)-  even  with  sin¬ 
gle  context  processors  (see  Figures  8-101.  Very  little  can  be 
gained  by  adding  extra  contexts. 

P-Thor  achieves  50-70‘X  utilization  with  single  contexts 
(«ee  Figures  5-7i.  Tliis  can  be  boosted  effectively  by  adding 
more  contexts.  Not  oiil.v  is  efficiency  increased  as  more  con¬ 
texts  are  added,  hut  the  processors  also  become  more  immune 
to  the  effect  of  high  latency  operations.  This  is  seen  by  the 
spreading  of  tlie  curves  as  the  latency  increases. 

MP'lD  li;u4  a  large  amount  of  global  traffic.  When  the 
'wiicli  latency  increases,  the  switch  becomes  the  bottleneck 
and  it  limit-,  the  gains  .achieved  by  multiple  contexts.  While 
'Oiiie  performance  gain  is  achieved,  the  relative  beiiehl  of 
tiiititiple  contexts  is  greater  for  lower  iaieticies.  .Vole  iiow  tlie 
(liffereui  context  lines  converge  as  the  switch  latency  increases 
ill  Figures  J  and 


5  Related*  Work 

The  idea  of  multiple  hardware  contexts  per  processor  in  itself 
is  not  new.  In  this  .section  we  di.sciis.s  how  our  approach  differs 
from  earlier  propo.sals  and  present  some  advantages  and  dis¬ 
advantages.  We  begin  with  the  .41to  personal  computer  from 
Xerox  {'21]  which  provided  multiple  hardware  microcode-level 
contexts,  allowing  the  CPU  to  be  shared  between  the  instruc¬ 
tion  set  interpreter  and  the  I/O  ilevices.  The  contexts  were 
statically  assigned  to  devices  and  were  not  available  to  gen¬ 
eral  user  processes.  The  aim  of  the  multiple  contexts  was  to 
make  the  power  of  the  processor  readily  available  for  time 
critical  I/O  processing,  a  task  that  is  fret|uently  delegated  to 
-separate  processors  in  more  recent  designs.  Unlike  out  moti¬ 
vation.  the  issue  wa.s  not  to  hide  memory  latency  from  a  very 
fast  processor. 

The  HEP  multiprocessor  from  Denelcor  [19]  also  provided 
inultiijle  hardware  contexts  per  processor.  Unlike  the  .4ho. 
the  contexts  were  available  to  arliitrary  user  processes.  The 
proces.ses  shared  a  large  set  of  registers  and  on  each  cycle  an 
instruction  from  a  different  process  was  executed.  .4  mini¬ 
mum  of  8  active  processes  (those  processes  that  are  not  wait¬ 
ing  for  a  memory  reference  to  complete)  were  needed  to  keep 
the  execution  pipeline  full.  The  HEP  machine  tolerated  mem¬ 
ory  latency  well,  hut  its  main  drawback  was  that  a  single 
process  could  get  at  most  1/8  of  the  pipelined  processor.  In 
order  to  keep  the  pipeline  full,  a  large  number  of  processes 
were  needed.  This  is  in  stark  contrast  to  modern  pipelined 
processors  [6,14]  where  a  single  process  almost  fully  utilizes 
the  pipelined  processor.  N'ow  the  HEP  scheme  would  not  be 
a  problem  if  all  applicaiion.s  could  be  split  into  an  arbitrarily 
large  number  of  processes.  However,  this  is  often  not  possible 
in  practice  as  there  may  not  be  enough  intrinsic  parallelism 
in  the  application  [7].  or  because  doing  so  greatly  increases 
the  amount  of  overhead. 

-More  recently.  lannucci  [ll]  has  proposed  using  multi¬ 
ple  contexts  for  his  hybrid  data-flow/von  Neumann  machine. 
Each  processor  consists  of  a  hardware  queue  of  enabled  con¬ 
tinuations.  The  continuations  are  very  small  in  size  (contain¬ 
ing  just  the  program  counter  and  the  frame  base-register  1. 
and  the  hardware  can  switch  between  them  in  a  single  cycle. 
However,  to  make  this  single  cycle  swiich  possible,  processor 
registers  are  not  saved  on  a  context  sivitch.  Consequently, 
the  software  is  structured  so  that  it  does  not  rely  on  reg¬ 
isters  being  valid  between  potential  context  switch  points. 
The  switch  points  are  synchronizing  references,  where  a  read 
to  a  location  tagged  empty  results  in  that  continuation  being 
suspended.  In  out  view,  tiie  disadvantages  of  lanmicci's  ap¬ 
proach  are  the  following.  First,  processes  can  not  make  full 
use  of  the  register  sets,  given  that  the  run-lengths  (the  num¬ 
ber  of  instructions  executed  between  switch  points)  are  very 
small  [ll]  and  registers  are  not  preserved  in  between.  We 
believe  that  extensive  use  of  registers  is  absolutely  critical  to 
the  performance  of  modern  processors  [()].  Second,  a  proces¬ 
sor  rliat  supports  a  large  number  of  continuations  (contexts) 
in  harilware.  keeps  track  of  witicli  ones  are  enablei)  and  uses 
a  complex  criterion  for  deciding  which  continuation  to  issue 
the  next  instruction  from  [12].  is  very  complicated.  We  be¬ 
lieve  such  a  processor  will  h.ive  a  significantly  more  complex 
pipeline  and  much  larger  area  than  a  simple  RISC  ptoces- 
sor.  Consequeinly.  the  cycle  time  of  such  a  machine  uxiuld  be 
slower  than  that  of  modern  RISC  processors.  Thus  the  hy¬ 
brid  macliine  ha.s  to  make  up  the  large  factor  that  it  loses  over 
conventional  microprocessors,  before  it  becomes  competitive. 
On  tlie  other  hand,  the  scheme  that  \ve  propose  does  not  lose 
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am  t  liin'i  over  moilern  RISC'  |>rocessors.  lii  fact,  it  i.<i  possible 
to  take  imiliiple  coiiiiiiercially  .available  RLSC  processor  chips 
(e.;;.. ’Motorola  ji.'sOOd  processor  and  cache  chips)  and  connect 
them  so  as  to  simulaie  multiple  conie.xts. 

U'e  noiv  consider  the  .\I.-\SA  architecture  proposed  by  Bert 
Halstead  !■!<].  In  this  architecture  each  processor  lias  a  fixed 
number  of  hardware  f risk-  frames.  Each  task  frame  is  capable 
of  siorinc  a  complete  process  context  and  consists  of  a  .set 
of  auxiliary  recisters  (like  the  program  counter)  and  a  set  of 
.iteneral  pnri>ose  re"isiers.  .'since  the  number  of  processes  may 
exceed  the  number  of  task  frames,  the  process  contexts  are 
allowed  to  overflow  into  memory.'  On  each  cycle,  a  context 
in  the  e  nahUrl  or  ctedi/ state  may  issue  an  instruction.  How¬ 
ever.  once  a  process  issues  an  instruction,  it  can  not  Lssue 
another  instruction  until  the  previous  instruction  has  com¬ 
pleted.  Thus,  in  its  current  form,  a  process  on  M.AS.A  can 
set  only  1/4  (inverse  of  pipeline  depth)  of  the  pipelined  pro¬ 
cessor's  performance.  .\s  (.li.scus.sed  above  for  HEP.  this  is  a 
major  lirawback.  Halstead  and  sroup  recosnize  it  [8]  and  are 
explorint;  ways  to  remove  this  restriction. 

We  now  discuss  a  more  subtle  but  fundamental  difference 
tieiween  the  lannucci  aiul  Halstead  .schemes  and  our  scheme. 
In  our  scheme,  the  sole  purpose  of  the  multiple  hardware 
contexts  is  to  mitizate  the  neaative  effects  of  memory  latency. 
The  number  of  hardware  contexts  needed  for  a  particular  ma¬ 
chine  is  fixed  and  depends  mainly  on  the  expected  cache  hit 
ratio  and  the  memory  latency  .for  that  architecture.  In  the 
lannucci  and  Halstead  schemes,  the  context  mechanism  is  in¬ 
stead  made  to  serve  two  purposes  at  the  same  time.  It  is 
used  to  ma.sk  memory  latency  as  in  our  scheme,  but  it  is  also 
used  a.s  a  hardware  ta.sk  tiueue.  Thus  when  a  parallel  subtask 
is  created,  it  manifests  itself  as  a  new  context  that  is  then 
managed  and  scheduled  by  the  hardware.  .Since  the  number 
of  parallel  subtasks  can  be  arbitrarily  large,  mechanisms  are 
needed  and  provided  to  handle  overflow  of  contexts.  Also,  the 
number  of  contexts  that  are  neetled  is  large.  In  our  scheme, 
the  issue  of  subtask  management  is  completely  separated  and 
is  handled  in  software.  This  permits  great  flexibility,  includ¬ 
ing  the  possibility  to  .'-chedule  tasks  in  a  manner  similar  to 
the  lannucci  and  Halstead  proposals,  if  a  particular  appli¬ 
cation  so  warrants.'’  Thus  instead  of  using  full/empty  bits 
and  hardware  queuing  in  I-siructure  memory  [lO].  we  may 
simulate  full/empty  bits  in  software  and  switch  to  a  different 
subtask  if  a  piece  of  data  is  not  ready.  It  is  not  obvious  which 
scheme  works  better.  We  will  be  able  to  tell  only  when  such 
machines  actually  get  built. 

6  Discussion 

This  section  contains  the  discussion  of  several  topics  that  re¬ 
late  to  the  evaluation  of  multiple  contexts  as  presented  in  this 
[laper. 

One  t|uestioii  that  we  must  .ask  is.  what  are  the  real  ad- 
vaniaces  of  having  multiple  contexts?  .Since  proces.sors  are 
cheap,  why  not  simply  have  a  larger  number  of  processors 
ill  the  miiitiprocfssor'.''  The  fallacy  in  this  argument  is  that, 
while  CPr  chips  le.g..  MC080.3(I  chips)  are  relatively  cheap,  a 
fast  processor  is  not  —  .a  fast  processor  nowadays  has  a  large 
amount  of  cache  built  out  of  expensive  and  fast  SRAMs:  in 
addition,  there  are  expensive  functional  iinit.s  such  as  floating 

■sticli  iiverflow  and  luiderHow  operations  are  quite  expensive, 
and  care  inusi  be  taken  to  ininitnize  them. 

■  We  Would  normally  expect  there  to  be  some  sort  of  a  »lis- 
ii'iliiiied  task  (|ueue  to  Iwuidle  the  •cherhilltig  of  subtasks. 


point  ALUs.  Furthermore,  feach  new  processor  needs  an  extra 
port  to  the  network,  or  to  the  bus  that  it  is  placed  on.  The 
extra  port  increases  the  depth  of  the  network,  or  the  loading 
on  the  bus.  thus  increasing  the  latency.  Several  contexts  per 
processor  can  share  these  expensive  resources,  thus  making 
more  efficient  use  of  them. 

Another  question  that  arises  is  how  the  multiple  contexts 
should  be  implemented.  The  multiple  contexts  do  not  neces- 
.sariiy  have  to  he  implemented  on  a  single  chip.  In  the  case 
where  the  size  of  each  processing  node  is  small,  on  the  order 
of  a  few  chips  [tl].  we  need  to  have  several  contexts  on  a  sin¬ 
gle  chip  using  duplicated  register  sets.  However,  having  to 
design  a  special  processor  for  a  given  architecture  makes  that 
architecture  less  practical.  So  for  larger  processing  nodes,  for 
example  where  each  processor  occupies  a  whole  board,  it  may 
be  quite  feasible  to  use  .separate  processor  chips  for  the  differ¬ 
ent  contexts.  While  simplifying  the  hardware  design  effort, 
this  approach  duplicates  not  Just  the  register  set  but  all  of 
the  data  path  and  control  as  well.^ 

There  are  some  software  i.ssues  to  be  resolved.  In  partic¬ 
ular.  how  do  you  choose  which  processes  to  put  on  a  single 
processor?  Since  the  progress  of  contexts  on  any  one  proces¬ 
sor  is  mutually  exclusive,  the  correct  placement  of  processes 
on  processors  may  be  important.  If  a  given  program  sec¬ 
tion  requires  several  contexts  to  be  active  in  order  to  make 
progress,  it  is  best  to  place  these  on  separate  processors. 

7  Conclusions 

In  scalable  multiprocessor  architectures,  processors  with  a 
small  fixed  number  of  contexts  can  achieve  substantially 
greater  efficiencies  than  single  context  proces.sors.  In  some 
cases  efficiencies  increased  65%  with  two  contexts  and  100%i 
with  four  contexts.  Best  improvements  are  found  in  archi¬ 
tectures  with  high  latency  operations  and  low  context  switch 
overheads.  Such  high  latency  operations  are  to  be  expected 
in  large-scale  multiprocessors.  Low  context  switch  overheads 
can  be  achieved  by  having  a  small  fixed  number  of  contexts 
in  hardware  and  by  using  a  simple  switch  criterion:  the  cache 
miss. 

One  important  difference  between  our  context  switch 
scheme  and  those  proposed  in  [8.11,19]  is  that  in  our  scheme 
the  context  switch  mechanism  is  separated  from  the  sub¬ 
task  management  mechanism.  This  makes  for  simpler  and 
faster  hardware  and  allows  greater  flexibility  and  application- 
dependent  performance  tuning. 

We  are  currently  working  on  more  detailed  simulations,  in¬ 
cluding  the  effects  of  finite  caches  and  cache  contention  when 
a  miss  is  satisfied  from  memory.  We  are  also  looking  fur¬ 
ther  into  the  issues  and  details  of  implementing  our  multiple 
conte.xt  scheme. 
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